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Abstract 

This paper describes a free/open-source implementation of the light sliding-window (LSW) part-of-speech tagger for the Apertium 
free/open-source machine translation platform. Firstly, the mechanism and training process of the tagger are reviewed, and a new method 
for incorporating linguistic rules is proposed. Secondly, experiments are conducted to compare the performances of the tagger under 
different window settings, with or without Apertium-style “forbid” rules, with or without Constraint Grammar, and also with respect to 
the traditional HMM tagger in Apertium. 
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1. Introduction 

Apertiunj^ is a shallow-transfer rule-based free/open- 
source machine translation platform. This paper reports a 
free/open-source implementation of the light sliding win¬ 
dow (LSW) PoS tagger ( ) Sanchez-Villamil et al., 20Q5| , and 
compares its performance with that of the original first- 
order HMM tagger in Apertium (Tyers et al., 201Q[ [Sheikh 


and Sanchez-Martmez, 2009}[Cutting et al., 1992] ^ Section 

2 reviews the mechanism of the LSW tagger and proposes 
a method to improve its tagging accuracy by incorporating 
linguistic rules. Section 3 shows the experimental results 
and discusses them, and finally, in Section 4, the paper ends 
with some conclusions and future plans. 

2. Methods 

The main difference between the LSW and HMM PoS tag¬ 
gers is that the LSW PoS tagger makes local decisions 
about the PoS tag of each word which are based on the am¬ 
biguity class (set of PoS tags) of words in a fixed-length 
context around the problem word, while HMM makes this 
decision by efficiently considering all possible disambigua¬ 
tions of all words in the sentence, by using a probabilis¬ 
tic model based on a multiplicative chain of transition and 
emission probabilities. In terms of model complexity, LSW 
is simpler than HMM, while, on the other hand, the number 
of parameters of LSW could be larger than that of HMM, 
which may have a crucial infiuence on the tagging perfor¬ 
mance as training material may not be sufficient to estimate 
them adequately. 

The LSW tagger is an improved version of the sliding win¬ 


dow (SW) PoS tagger ( {Sanchez-Villamil et al., 2004j ), and 
the main goal of the LSW tagger is to reduce the parameters 
of a SW tagger, by using approximations for the parameter 
estimation, without a significant loss in accuracy. There¬ 
fore, we briefiy describe the SW tagger first, and then the 
LSW tagger. 


^The Apertium machine translation engine, linguistic data for 
various language pairs, and documentation can be downloaded 
from http : / /www. apertium. org 


2.1. The SW tagger 

2.1.1. Overview 

Let r = {71,72, ... , 7 |r|} be the tag set, and W = 
{wi,W 2 , ... } be the words to be tagged. A partition of 
W is established so that Wi = Wj if and only if both 
are assigned the same subset of tags, where each class 
of the partition is called an ambiguity class. Let S = 
{(Ji, < 72 ,..., cr|^|} be the collection of ambiguity classes, 
where each is an ambiguity class. Let T : E ^ 2^ be 
the function returning the collection T(cr) of PoS tags for 
an ambiguity class cr. 

The PoS tagging problem may be formulated as follows: 
given a text re[1]re[2 ].. .w[L] G TL+, each word w[t\ is 
assigned (using a lexicon and a morphological analyzer) an 
ambiguity class cr[t] G E to obtain the ambiguously tagged 
text cr[l]cr[2]... cr[t] G 11+; the task of a PoS tagger is to 
obtain a tag sequence 7 [ 1 ] 7 [ 2 ] ... 7 [t] G r+ as correct as 
possible, that is, the one that maximizes the probability of 
that tag sequence given the word sequence: 


7 *[ 1 ]... 7 *[L] = argmax P( 7 [l]... 7 [L] |(t[ 1] ... cr[L]) 

7[t]eT(7[t]) 


( 1 ) 

The core idea of SW PoS tagging is to use the ambiguity 
classes of neighboring words to approximate the dependen¬ 
cies locally: 


t=L 

P(7[l]... j[L]\a[l]... a[L]) = JJ p{j[t] |C'(_)cr[i]C'(+)) 

t=l 

( 2 ) 

where t = 1 ... L, C(_) is the left context of length A^(-) 
(e.g. if A^(-) = 1 , then C(_) = ^[t — 1 ]), and C(+) is the 
left context of length A^(+). 

2.1.2. Unsupervised parameter estimation 

Let p( 7 |C(_)crC(+)) be the probability of a tag 7 appearing 
between the context C(_) and . The most probable tag 
7 * [t] is selected as the one with the highest probability by 
the formula: 


7*M = argmax p(7|C(_)C7C+)) (3) 

7eT(cr[t]) 












Estimating the parameters from a tagged corpus would be 
straightforward, but estimating from an untagged corpus 
requires an iterative process. Let nC(_) 7 C(+) (a simpler 
and interchangeable representation for p( 7 |C(_)crC(+)) ) 
be the effective number of times (count) that 7 appears be¬ 
tween the context C(_) and C(+). Following the steps in 


(Sanchez-Villamil et ah, 20041, we can estimate nC(_) 7 C(+) 
iteratively by: 

~[k] _ 

^C(_) 7 C(+) - 


.[k-1] 

^C(-) 7 C(+) 


E 


nC(_)<TC(+) 


E ’ 


[k-1] 


(4) 


A recommended initial value could be obtained by assum¬ 
ing that all the tags 7 in a are equally probable. 

2.2. The LSW tagger 
2.2.1. Overview 

The SW tagger tags a word by looking at the ambiguity 
classes of neighboring words, and has therefore a number 
of parameters in |r|). The LSW tagger 

( | Sanchez-Villamil et al., 2003] ) tags a word by looking at 
the possible tags of neighboring words, and therefore it has 
a number of parameters in Usually the 

tag set size |r| is significantly smaller than the combina¬ 
tional ambiguity class size |E|. In this way, the number 
parameters is effectively reduced. 

The LSW approximates the best tag as follows: 


Items 

Spanish 

Catalan 

English 

Words (train) 

3 million 

4 million 

3 million 

Amb. classes (train) 

106 

92 

68 

Words (test) 

25,000 

25, 000 

30, 000 

Amb. rate (test) 

22.81% 

31.13% 

29.97% 

Forbid rules 

545 

272 

117 

Enforce rules 

15 

25 

41 


Table 1: Major statistics for the training and test data. 


2.3. LSW with forbid and enforce rules 


There are forbid and enforce rules for sequences of two PoS 
tags in the current implementation of the Apertium PoS tag¬ 
ger. They were successfully applied in the original HMM 
tagger in Apertium, with a significant improvement in ac¬ 
curacy ( Sheikh and Sanchez-Martmez, 2009| , simply by 
making the corresponding transition probabilities equal to 
zero. The SW tagger could not make use of forbid and en¬ 
force rules because of the fact that it works with ambiguity 
classes, while on the other hand, the LSW tagger can easily 
incorporate them as it works directly with PoS tags 
The rules can be introduced right after the initialization 
step. For a tag sequence in the parameter space, if any con¬ 
secutive two tags match a forbid rule or fail to match an 
enforce rule, the underlying parameter will be 

given a starting value of zero. 

In this way, for an LSW tagger with rules, the initial value 
could be given as follows. 


7 * = argmax 

ieTia[t]) 

^ p(E(_)7^(+)|C(_)[i]7MC(+)[i]) (5) 

where T' : E* ^ 2^*, an extension of T, returns the set of 
tag sequences for an ambiguity sequence; and 
are the left and right tag sequence respectively. 


^ [ 0 ] jo if ^(-) 7 ^(+) is not valid, 

^(-) 7 ^(+) otherwise 

where 

^(+)=^(+)^^7C(+)) 


2.2.2. Unsupervised parameter estimation 

Following a procedure similar to that for the SW tagger, we 
can derive an iterative process to train the LSW tagger. 


~[k] _ ~ [k — 1] 

%_)7-E(+) - %-)7B(+) 


E 

cr-.ieT{cr) 

C(_):E(_)eT'(C(_)) 






( 6 ) 




'y' eT{cr) 

C(_):B(_)eT'(C(_)) 

\ C( + ):E(+)6T'(C( + )) / 


where is the effective number of times (count) 

that 7 appears between the context of tags and . 
Similarly to the initialization step in the SW tagger, a rec¬ 
ommended initial value can be obtained by assuming that 
all the tag sequences in the window C(_)crC(+) 

are equally probable. 


where, the validity of L^(_) 7 L^(+) is determined by forbid 
and enforce rules, and the function V returns the collec¬ 
tion of valid (enforced or not forbidden) tag sequences con¬ 
tained in the ambiguity class sequence C(_)crC(+). 


3. Experiments 

3.1. Training data and test set 

The experiments are conducted on three languages: 
Spanish (apertium-en-es- 0 . 8 . 0 ), Cata¬ 

lan (apartium-es-ca- 1 . 1 . 0 ), and English 
(apartium-en-es- 0 . 8 . 0 ). We obtain the train¬ 
ing data for Spanish and English by sampling text from 
the Europarl corpus ( |Koehn, 2005| ), and for Catalan 
by sampling text from the Catalan Wikipedia. The 
statistics on the training data and test data are shown in 
Table Test data for Catalan and Spanish come from 
apert ium-es-ca- 1 . 1 . 0 . It is worth noting that the 
English test set has been built by mapping the results form 
the TnT ( Brants, 2000| ) tagger as an approximation. 



























3.2. The LSW tagger vs. the SW tagger 

We firstly study whether there is a difference between the 
LSW tagger and the SW tagger, keeping all other settings 
the same. Then we study whether rules can help improve 
the accuracy for the LSW tagger. “Accuracy” in the graph 
refers to the tagging precision of a tagger on the hand- 
tagged test set. Figure shows that rules help significantly 
for improving accuracy, and that the SW tagger behaves 
similarly to the LSW tagger without rules, which is consis¬ 
tent with the conclusion in ( | Sanchez-Villamil et al., 2005] ). 



♦ LSWH.+I) 
■♦-LSW(-1,+1)- 
No-Rules 
V SW(-1,+1) 


Spanish Europarl Lines 



♦ LSWH.+I) 
■•■LSW(-1,+1)- 
No-Rules 
V SW(-1,+1) 



English Europarl Lines 


■■■LSW(-1,+1) 
-♦■LSW(-1,+1)- 
No-Rules 
V- SW(-1,+1) 


Figure 1: Performance evaluation for (1) the LSW(-1, +1) tag¬ 
ger, (2) the LSW(-1, +1) tagger without rules, denoted as LSWi¬ 
fi +l)-No-Rules, and (3) the SW(-fi +1) tagger, all on Spanish, 
Catalan, and English. 


3.3. Different window settings for the LSW tagger 

We study the performances of the LSW tagger with differ¬ 
ent window settings, and of the HMM tagger, on the three 
languages, as shown in FigureWe can see that the HMM 
tagger performs best among all the taggers, especially when 
there is enough training data. However, when training data 
is limited, the LSW taggers learn faster (need less words to 
learn) and more stably than the HMM tagger. 

Among all the LSW taggers, the LSW(-fi -hi), i.e. left con¬ 
text 1 and right context 1, performs best. When there are 
enough training data, the performances of the HMM tagger 
and the LSW(-fi -hi) tagger are quite close. 

Note that under some window settings, the performances 
of the LSW taggers even decrease as more training lines 


were added, e.g. LSW(-l) and LSW(-2, -1) for Spanish 
and Catalan. This is an unexpected phenomenon, and the 
reason for it would require further investigation. 



♦ HMM-Bigram 
■♦■LSW(-1,+1) 

V- LSW(-1) 
-A-LSW(+1) 
■►-LSW(-2, -1) 

<3 LSW(+1,+2) 
■M-LSW(-2, -1,+1) 
*LSW(-1,+1,+2) 


Spanish Europarl Lines 



Catalan Wikipedia Lines 


♦ HMM-Bigram 
■♦■LSW(-1,+1) 

V- LSW(-1) 
-A-LSW(+1) 
■►LSW(-2, -1) 

<3 LSW(+1,+2) 
■M-LSW(-2, -1,+1) 
■»-LSW(-1,+1,+2) 



0.94 
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•• HMM-Bigram 
■♦■LSW(-1,+1) 

^ LSW(-I) 
-A-LSW(+1) 
♦-LSW(-2, -1) 

<3 LSW(+1,+2) 
■M-LSW(-2, -1,+1) 
*LSW(-1,+1,+2) 


English Europarl Lines 


Figure 2: Different window settings and their performance, tested 
on Spanish, Catalan, and English. 


3.4. Using Constraint Grammar rules to support the 
HMM and LSW 

We also tested whether the use of Constraint Gram¬ 
mar (CG) rules helps to improve the accuracy obtained 
by both HMM and LSW taggers, along the lines sug¬ 
gested in ( Hulden and Francom, 201^ . For that, 
we used the CG rules already present in Apertium 
packages apert ium-eo-es-0.8.2 for Spanish and 
apart ium-eo-ca-0.8.2 for Catalan respectively (a 
CG module is integrated in many Apertium language pairs). 
Figure shows that CG helps almost in all settings. It is 
also shown that CG rules help the two taggers in different 
situations: for the HMM tagger, the positive contribution of 
CG rules is larger when training data is limited than when 
training data is relatively enough; while for the LSW tagger, 
the trend is almost the opposite, that CG rules contribute 
even more when training data is relatively enough. Note 
that the logical approach would be to use CG rules both 
for reducing ambiguity for the training corpus (denoted as 
cgTrain in Figure and for reducing ambiguity right after 
morphological analyzer and before the PoS tagger (denoted 
as cgTag in Figure [^; the results are however almost in¬ 
distinguishable from those obtained applying CG in either 
step. 


























































































Spanish Europah Lines 


••HMM 

■♦■HMM-cgTrain 

■▼•HMM-cgTag 

-A-HMM-cgTrain- 

cgTag 



••HMM 

■♦■HMM-cgTrain 
^HMM-cgTag 
• HMM-cgTrain- 
cgTag 


Catalan Wikipedia Lines 



• LSW 

■♦■LSW-cgTrain 

-^LSW-cgTag 

■♦■LSW-cgTrain- 

cgTag 


Spanish Europart Lines 



• LSW 

■♦■LSW-cgTrain 
V LSW-cgTag 

• LSW-cgTrain- 
cgTag 


Catalan Wikipedia Lines 


Figure 3: Performance evaluation for HMM and LSW with and 
without CG. 


4. Discussion and future work 


We reviewed the mechanism and unsupervised parameter 
estimation methods for both the SW and LSW taggers. 


Compared with previous work (Sanchez-Villamil et al., 


|2004t |Sanchez-Villamil et al., 2005 1, firstly, we proposed 
a method for incorporating the forbid and enforce rules al¬ 
ready used for HMM taggers in Apertium into the LSW 
tagger; and secondly, the implementation is the first time 
that the LSW tagger is integrated into a real machine trans¬ 
lation system (Apertium), and at the same time, its code is 
free/open-source. 

We also conducted experiments to compare the perfor¬ 
mances of the LSW tagger with different settings, and with 
respect to the original HMM tagger. Firstly, the HMM tag¬ 
ger performs slightly better than the LSW(-1, +1) tagger 
when there is enough training data, while the LSW(-1, +1) 
tagger learns faster and is more stable when training data 
is limited. Secondly, the LSW(-1, +1) tagger performs best 


among all the other window settings, and better than the 
SW(-1, +1) tagger, which behaves similarly with LSW(-1, 
+l)-No-Rules. Thirdly, we have found that the use of CG 
rule sets already existing in some Apertium taggers helps 
significantly to improve accuracy based both on the HMM 
and LSW taggers, and that for the HMM tagger CG rules 
help more when training data is limited, while for the LSW 
tagger CG rules help more when training data is relatively 
enough. 

The reason why the performance of the LSW tagger under 
some window settings worsens as more training lines are 
added also requires more efforts to study. Source code is 
available through the Apertium Subversion repositor}0 un¬ 
der a free/open-source license. 
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